Skip to content

API: Reformat output of groupby.describe (#4792) #15260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

mroeschke
Copy link
Member

Doesn't look like this was address in a PR in 0.20, but the original issue works on master.

'VOLUME': volumes})
result = df.groupby('PRICE').describe()
expected_index = pd.MultiIndex(levels=[[24990, 25499],
['count', 'mean', 'std',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of constructing this way

use concat with keys on the subframes

@jreback jreback added Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode Testing pandas testing functions or related to the test suite labels Jan 30, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 30, 2017
@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

Actually I think the bug is still present. The issue that you want this

In [10]: df.groupby('PRICE').VOLUME.describe().unstack(1)
Out[10]: 
       count          mean           std           min           25%           50%           75%           max
PRICE                                                                                                         
24990    1.0  1.500000e+09           NaN  1.500000e+09  1.500000e+09  1.500000e+09  1.500000e+09  1.500000e+09
25499    2.0  2.550000e+09  3.464823e+09  1.000000e+08  1.325000e+09  2.550000e+09  3.775000e+09  5.000000e+09

rather than this

In [9]: df.groupby('PRICE').VOLUME.describe()
Out[9]: 
PRICE       
24990  count    1.000000e+00
       mean     1.500000e+09
       std               NaN
       min      1.500000e+09
       25%      1.500000e+09
       50%      1.500000e+09
       75%      1.500000e+09
       max      1.500000e+09
25499  count    2.000000e+00
       mean     2.550000e+09
       std      3.464823e+09
       min      1.000000e+08
       25%      1.325000e+09
       50%      2.550000e+09
       75%      3.775000e+09
       max      5.000000e+09
Name: VOLUME, dtype: float64

we do a similar unstack already with .ohlc.

In [8]: df.groupby('PRICE').VOLUME.ohlc()
Out[8]: 
             open        high         low       close
PRICE                                                
24990  1500000000  1500000000  1500000000  1500000000
25499  5000000000  5000000000   100000000   100000000

So each group gets a single row, while multi-columns are present for multiple aggregations.
multi-index are present for multiple groupers.

@jreback jreback removed the Testing pandas testing functions or related to the test suite label Jan 30, 2017
@mroeschke
Copy link
Member Author

Ah that makes sense. Thanks for clarifying the expected output!

I've been poking into the code, and since each group goes through apply() and describe() returns the metrics labeled on the index, it tries to vertically concat the describe() results for each group:

(Pdb) values
[             VOLUME
count  1.000000e+00
mean   1.500000e+09
std             NaN
min    1.500000e+09
25%    1.500000e+09
50%    1.500000e+09
75%    1.500000e+09
max    1.500000e+09,              
              VOLUME
count  2.000000e+00
mean   2.550000e+09
std    3.464823e+09
min    1.000000e+08
25%    1.325000e+09
50%    2.550000e+09
75%    3.775000e+09
max    5.000000e+09]

I could add some logic saying if the indexes for each group are the same, concat on the index and transpose to the columns, but I think that'd be a pretty big change since it will probably affect all groupby.apply() functions. Or should we make a special case for describe()? Thoughts @jreback?

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

so you do just need to define .describe as a method on DataFrameGroupby and SeriesGroupby, then you can do the .apply (followed by the unstack).

Right now we automagically just do any whitelisted function (including .describe) and they go thru the standard reshaping things. If you can figure out a generic way to do this great, but otherwise defining a method is fine.

@mroeschke mroeschke changed the title TST: groupby.describe levels don't appear as column (#4792) [WIP] : Reformat output of groupby.describe (#4792) Jan 31, 2017
@mroeschke
Copy link
Member Author

Cool, thanks for the insight.

I defined a new method for groupby.describe() and noticed in the process if groupby(...,axis=1).describe() is called, the transposing returns the desired results instead of unstacking. I also had to subsequently change a lot of the existing tests since the output has changed. Most notably the test for #14848 changed a lot.

@@ -366,6 +366,7 @@ Other API Changes
- ``inplace`` arguments now require a boolean value, else a ``ValueError`` is thrown (:issue:`14189`)
- ``pandas.api.types.is_datetime64_ns_dtype`` will now report ``True`` on a tz-aware dtype, similar to ``pandas.api.types.is_datetime64_any_dtype``
- ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
- ``groupby.describe()`` now labels the `describe()` metrics in the column instead of the index (:issue:`4792`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think move this to a sub-section and show previous and current behavior.

@@ -1140,6 +1139,17 @@ def ohlc(self):

@Substitution(name='groupby')
@Appender(_doc_template)
def describe(self, **kwargs):
"""
Provide summary statistics for each group, excluding NaN values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a full doc-string, with named arguments. You might be able to simply add Series.describe and DataFrame.describe in the Notes section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or as you do below just use the DataFrame doc-string

@mroeschke mroeschke force-pushed the fix_4792 branch 2 times, most recently from 4b5d367 to 4375710 Compare February 3, 2017 07:17
@mroeschke
Copy link
Member Author

Added previous/current behavior in whatsnew, documentation to describe with DataFrame.describe.__doc__, and fixed other failing tests.

^^^^^^^^^^^^^^^^^^^^^^^^^^^

The output formatting of ``groupby.describe()`` now labels the ``describe()`` metrics in the columns instead of the index.
This format is consistent with ``groupby.ohlc()`` (:issue:`4792`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to the point its consistent with how .agg() works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most people prob don't know about .ohlc() :>


New Behavior:

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an ipython-block here (so the code executes)

expected.index.names = ['A', None]
expected = pd.concat([(df[df.A == 1].B
.describe()
.to_frame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like lots of reshapings (this is current master)

In [6]: df
Out[6]: 
   A    B    C
0  1  2.0  foo
1  1  NaN  bar
2  3  NaN  baz

In [7]: df.groupby('A').describe()
Out[7]: 
           B
A           
1 count  1.0
  mean   2.0
  std    NaN
  min    2.0
  25%    2.0
  50%    2.0
  75%    2.0
  max    2.0
3 count  0.0
  mean   NaN
  std    NaN
  min    NaN
  25%    NaN
  50%    NaN
  75%    NaN
  max    NaN

In [8]: df.groupby('A').describe().unstack()
Out[8]: 
      B                                  
  count mean std  min  25%  50%  75%  max
A                                        
1   1.0  2.0 NaN  2.0  2.0  2.0  2.0  2.0
3   0.0  NaN NaN  NaN  NaN  NaN  NaN  NaN

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, the result = df.groupby('A').describe().unstack() after unstack() was added to groupby.describe(). Shouldn't expected follow an independent path to the result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but you can simply directly construct this result (as its 'simple' enough), just pd.DataFrame(.....)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see thanks! Agreed that my it my first edit probably included too much reshaping.

@mroeschke mroeschke changed the title [WIP] : Reformat output of groupby.describe (#4792) API: Reformat output of groupby.describe (#4792) Feb 6, 2017
@mroeschke
Copy link
Member Author

Replaced API example using groupby.agg() instead of groupby.ohlc(), fixed the code block, and simplified expected in test_non_cython_api

@jreback
Copy link
Contributor

jreback commented Feb 7, 2017

hmm, something is wrong here. Its including the grouper.

In [1]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [2]: df.groupby('A').describe()
Out[2]: 
      A                                        B                                          
  count mean  std  min  25%  50%  75%  max count mean       std  min   25%  50%   75%  max
A                                                                                         
1   2.0  1.0  0.0  1.0  1.0  1.0  1.0  1.0   2.0  1.5  0.707107  1.0  1.25  1.5  1.75  2.0
2   2.0  2.0  0.0  2.0  2.0  2.0  2.0  2.0   2.0  3.5  0.707107  3.0  3.25  3.5  3.75  4.0

In [3]: df.groupby('A').agg(['mean', 'std'])
Out[3]: 
     B          
  mean       std
A               
1  1.5  0.707107
2  3.5  0.707107

@mroeschke
Copy link
Member Author

It looks like the grouper is included when using groupby.apply():

In [4]: pd.__version__
Out[4]: u'0.19.2' #not current master

In [5]: df.groupby('A').apply(lambda x: x.describe()) 
Out[5]:
           A         B
A
1 count  2.0  2.000000
  mean   1.0  1.500000
  std    0.0  0.707107
  min    1.0  1.000000
  25%    1.0  1.250000
  50%    1.0  1.500000
  75%    1.0  1.750000
  max    1.0  2.000000
2 count  2.0  2.000000
  mean   2.0  3.500000
  std    0.0  0.707107
  min    2.0  3.000000
  25%    2.0  3.250000
  50%    2.0  3.500000
  75%    2.0  3.750000
  max    2.0  4.000000

Can be seen with other functions as well:

In [11]: df.groupby('A').apply(np.mean) #not idiomatic but should be similar to  df.groupby('A').mean()
Out[11]:
     A    B
A
1  1.0  1.5
2  2.0  3.5

Is this a known issue?

@jreback
Copy link
Contributor

jreback commented Feb 7, 2017

@mroeschke so by-definition this is what apply does.

you can use ._set_group_selection() to avoid this problem.

@mroeschke
Copy link
Member Author

Ah thanks for clarifying that @jreback. Will edit to use ._set_group_selection() tonight.

Restructure describe def

Fix another test

Refactoring tests

linting & patch groupby tests

add whatsnew

fix docstring

fix more tests

Added api example and documentation to describe

fix potential pep8 complaint

adjust doc description

renamed original test and add agg example in doc

simplify example

 Eliminate grouper from result

simplify example in the whatsnew
@codecov-io
Copy link

codecov-io commented Feb 9, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@c23b1a4). Click here to learn what that means.

@@            Coverage Diff            @@
##             master   #15260   +/-   ##
=========================================
  Coverage          ?   86.32%           
=========================================
  Files             ?      141           
  Lines             ?    51177           
  Branches          ?        0           
=========================================
  Hits              ?    44180           
  Misses            ?     6997           
  Partials          ?        0
Impacted Files Coverage Δ
pandas/core/groupby.py 95.13% <92.85%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c23b1a4...618bc46. Read the comment docs.

@mroeschke
Copy link
Member Author

Added _set_group_selection() to prevent the grouper from being included and resolved conflict with whatsnew

@jreback jreback closed this in 3d6fcdc Feb 10, 2017
@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

thanks!

keep em coming!

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#4792

Author: Matt Roeschke <[email protected]>
Author: Matthew Roeschke <[email protected]>

Closes pandas-dev#15260 from mroeschke/fix_4792 and squashes the following commits:

618bc46 [Matthew Roeschke] Merge branch 'master' into fix_4792
184378d [Matt Roeschke] TST: groupby.describe levels don't appear as column (pandas-dev#4792)
@mroeschke mroeschke deleted the fix_4792 branch December 20, 2017 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

describe on a groupby
3 participants